Code Author: Lance Wrobel

In-progess individual project on unsupervised learning methods applied to online retail data.

Methods used: Association Rule Mining, Kmeans Clustering, Hierarchical Clustering

The dataset used is from the UCI Machine Learning Repository, the link is below: http://archive.ics.uci.edu/ml/datasets/Online+Retail

source("RetailAnalysisFunctions.R") # contains the functions I wrote which are called from this notebook
retail_dataset <- read.csv("OnlineRetail.csv")

retail_dataset <- as_tibble(retail_dataset) # makes it easier to print the dataset and check columns

colnames(retail_dataset)[[1]] <- "InvoiceNumber" # this column name was orginally read in wrong
colnames(retail_dataset)[[5]] <- "InvoiceDateAndTime"

The first section of this R Notebook performs a Association Rule Mining Analysis.

The following code converts the retail dataset into a transactions object which can be used in the ‘arules’ package. This function uses the entire product decription as the item obtained in the transaction.

transactions<-convert_to_transactions(retail_dataset,description_start_end=c(0,0)) 
itemFrequencyPlot(transactions, topN=10, col=brewer.pal(8,'Pastel1'), type="absolute", main="Item Frequency") 

rules1 <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.60)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 259 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4223 item(s), 25900 transaction(s)] done [0.45s].
## sorting and recoding items ... [591 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.04s].
## writing ... [179 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
plot(rules1, engine = "htmlwidget")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules1, method = "graph", engine = "htmlwidget")
## Warning: Too many rules supplied. Only plotting the best 100 rules using
## lift (change control parameter max if needed)
transactions_2<-convert_to_transactions(retail_dataset,description_start_end=c(3,6))
rules2 <- apriori(transactions_2, parameter = list(supp = 0.005, conf = 0.70)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 129 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[724 item(s), 25900 transaction(s)] done [0.04s].
## sorting and recoding items ... [280 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
plot(rules2, method = "graph", engine = "htmlwidget")

The next section of this R notebook performs Kmeans and Hierarchical Clustering for customer segmentation.

The first step is to create a customer profile vector for each customer based off their purchasing habits. This code chunk contains only some basic features as a start. The following features are used per customer: total number of transactions, average quantity purchased, average transaction cost, and the number of distinct weeks a transaction was placed (for a consistancy feature).

retail_dataset_mutated <- retail_dataset %>% mutate(transaction_cost = UnitPrice*Quantity) %>%
  separate(InvoiceDateAndTime, c("InvoiceDate","InvoiceTime")," ")

retail_dataset_mutated$InvoiceDate <- as.Date(retail_dataset_mutated$InvoiceDate,"%m/%d/%Y")

retail_dataset_mutated$DateFirstOrder <- with(retail_dataset_mutated, ave(InvoiceDate,CustomerID, FUN = min))
retail_dataset_mutated$DateLastOrder <- with(retail_dataset_mutated, ave(InvoiceDate,CustomerID, FUN = max))

retail_dataset_mutated$week <- week(retail_dataset_mutated$InvoiceDate)

customer_profiles <- retail_dataset_mutated %>% group_by(CustomerID) %>% 
  summarise(avg_cost_of_transaction = mean(transaction_cost),avg_quantity_bought = mean(Quantity),total_transactions = n(),
            distinct_weeks_of_a_transaction = n_distinct(week)) %>% select(-CustomerID)

Focus in on customers who don’t spend very large amounts or buy very large quantities.

customer_profiles_no_outliers<-customer_profiles %>% filter(between(avg_quantity_bought, 0,50), between(avg_cost_of_transaction,0, 60),between(total_transactions, 1,50))

Below I use Kmeans clustering with k=4 based on the customer profiles.

k_means <- kmeans(customer_profiles_no_outliers,4)

plot(k_means,data=customer_profiles_no_outliers)

I now do some Hierarchical Clustering using only average cost of transaction and number of total transactions as customer profile features.

library(stats)

# dont use consistancy feature or quantity feature for visualization purposes
customer_profiles_subset <- customer_profiles_no_outliers %>% select(avg_cost_of_transaction, total_transactions)

clusters<-hclust(dist(customer_profiles_subset)) 

clusterCut <- cutree(clusters, 3)

ggplot(customer_profiles_subset, aes(avg_cost_of_transaction, total_transactions)) + 
  geom_point() + geom_point(col = clusterCut) + 
  scale_color_manual(values = c('black', 'red', 'green'))